Midterm Project: Effects of Air Pollution on Countries

1. Introduction

Motivation

Air pollution affects blah blah blah… and with the increased worsening in climate and air quality, blah blah blah…. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?

Set Up

Before we start, we need to ensure that we have all the relevant libraries installed and imported.

Run these in the console to install packages in addition to the ezids package.

install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")

2. Data Sources and Data Wrangling

Data Sources

For our analysis, we will be working with 5 main data sources shown in the table below:

Figure 1: Data Sources
Data Source Link
Deaths Due to Air Pollution of Countries from 1990 - 2017 Kaggle Link
GDP Annual Growth of Countries from 1960 - 2020 Kaggle via WorldBank Link
United Nations Population and Region Data United Nations Link
United Nations ISO-alpha3 code United Nations Link
spData for Map Geometries spData for Mapping Link

The main variables in our datasets will include:

Figure 2: Key Variables
Feature Data Type Unit of Measure Notes and Assumptions
GDP (Gross Domestic Product) Numerical, Continuous $USD This is our chosen proxy for measuring a country’s economic status
Population Size Numerical, Continuous thousands of people Annual UN estimated
Deaths due to Air Pollution Numerical, Continuous deaths per million This is our chosen proxy for measuring the negative affects of air pollution.
Country Qualitative, Categorical N/A 231 countries
SDG Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Region Classification.
Sub Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Sub-Region Classification.
ISO-alpha3 Country Code Qualitative, Categorical N/A Standard for identifying countries (text ID).
ISO-alpha2 Country Code Qualitative, Categorical N/A Another standard for identifying countries (text ID).
M49 Country Code Numerical, Categorical N/A Another standard for identifying countries (numerical ID).
Year Numerical, Categorical N/A 1990 to 2017
GDP per Capita Numerical, Continuous $USD per person Normalization of GDP to compare between population sizes (calculated).

Data Wrangling

While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.

UN Data Sample Messy
Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
UN Data Sample Cleaned
Figure 4: Sample screenshot of transformed UN dataset.

3. Load, Clean, and Inspect Data

Load Data

## 'data.frame':    249 obs. of  4 variables:
##  $ Country.or.Area: chr  "Andorra" "United Arab Emirates (the)" "Afghanistan" "Antigua and Barbuda" ...
##  $ ISO.alpha2.code: chr  "AD" "AE" "AF" "AG" ...
##  $ ISO.alpha3.code: chr  "AND" "ARE" "AFG" "ATG" ...
##  $ M49.code       : int  20 784 4 28 660 8 51 24 10 32 ...
## 'data.frame':    6468 obs. of  7 variables:
##  $ Entity                                         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Code                                           : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ Year                                           : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Air.pollution..total...deaths.per.100.000.     : num  299 291 279 279 287 ...
##  $ Indoor.air.pollution..deaths.per.100.000.      : num  250 243 232 232 239 ...
##  $ Outdoor.particulate.matter..deaths.per.100.000.: num  46.4 46 44.2 44.4 45.6 ...
##  $ Outdoor.ozone.pollution..deaths.per.100.000.   : num  5.62 5.6 5.61 5.66 5.72 ...
## 'data.frame':    264 obs. of  66 variables:
##  $ Country.Name  : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country.Code  : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Indicator.Name: chr  "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
##  $ Indicator.Code: chr  "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" ...
##  $ X1960         : num  NA 537777811 NA NA NA ...
##  $ X1961         : num  NA 548888896 NA NA NA ...
##  $ X1962         : num  NA 546666678 NA NA NA ...
##  $ X1963         : num  NA 751111191 NA NA NA ...
##  $ X1964         : num  NA 800000044 NA NA NA ...
##  $ X1965         : num  NA 1006666638 NA NA NA ...
##  $ X1966         : num  NA 1399999967 NA NA NA ...
##  $ X1967         : num  NA 1673333418 NA NA NA ...
##  $ X1968         : num  NA 1373333367 NA NA NA ...
##  $ X1969         : num  NA 1408888922 NA NA NA ...
##  $ X1970         : num  NA 1748886596 NA NA 78619206 ...
##  $ X1971         : num  NA 1831108971 NA NA 89409820 ...
##  $ X1972         : num  NA 1595555476 NA NA 113408232 ...
##  $ X1973         : num  NA 1733333264 NA NA 150820103 ...
##  $ X1974         : num  NA 2155555498 NA NA 186558696 ...
##  $ X1975         : num  NA 2366666616 NA NA 220127246 ...
##  $ X1976         : num  NA 2555555567 NA NA 227281025 ...
##  $ X1977         : num  NA 2953333418 NA NA 254020153 ...
##  $ X1978         : num  NA 3300000109 NA NA 308008898 ...
##  $ X1979         : num  NA 3697940410 NA NA 411578334 ...
##  $ X1980         : num  NA 3641723322 5930503401 NA 446416106 ...
##  $ X1981         : num  NA 3478787909 5550483036 NA 388958731 ...
##  $ X1982         : num  NA NA 5550483036 NA 375895956 ...
##  $ X1983         : num  NA NA 5784341596 NA 327861833 ...
##  $ X1984         : num  NA NA 6131475065 1857338012 330070689 ...
##  $ X1985         : num  NA NA 7553560459 1897050133 346737965 ...
##  $ X1986         : num  405463417 NA 7072063345 2097326250 482000594 ...
##  $ X1987         : num  487602458 NA 8083872012 2080796250 611316399 ...
##  $ X1988         : num  596423607 NA 8769250550 2051236250 721425939 ...
##  $ X1989         : num  695304363 NA 10201099040 2253090000 795449332 ...
##  $ X1990         : num  764887117 NA 11228764963 2028553750 1029048482 ...
##  $ X1991         : num  872138715 NA 10603784541 1099559028 1106928583 ...
##  $ X1992         : num  958463184 NA 8307810974 652174991 1210013652 ...
##  $ X1993         : num  1082979721 NA 5768720422 1185315468 1007025755 ...
##  $ X1994         : num  1245688268 NA 4438321017 1880951520 1017549124 ...
##  $ X1995         : num  1320474860 NA 5538749260 2392764853 1178738991 ...
##  $ X1996         : num  1379960894 NA 7526446606 3199642580 1223945357 ...
##  $ X1997         : num  1531944134 NA 7648377413 2258515610 1180597273 ...
##  $ X1998         : num  1665100559 NA 6506229607 2545967253 1211932398 ...
##  $ X1999         : num  1722798883 NA 6152922943 3212119044 1239876305 ...
##  $ X2000         : num  1873452514 NA 9129594819 3480355189 1429049198 ...
##  $ X2001         : num  1920111732 NA 8936063723 3922099471 1546926174 ...
##  $ X2002         : num  1941340782 4055179566 15285594828 4348070165 1755910032 ...
##  $ X2003         : num  2021229050 4515558808 17812705294 5611492283 2361726862 ...
##  $ X2004         : num  2228491620 5226778809 23552052408 7184681399 2894921778 ...
##  $ X2005         : num  2330726257 6209137625 36970918699 8052075642 3159905484 ...
##  $ X2006         : num  2424581006 6971285595 52381006892 8896073938 3456442103 ...
##  $ X2007         : num  2615083799 9747879532 65266452081 10677321490 3952600602 ...
##  $ X2008         : num  2745251397 10109225814 88538611205 12881354104 4085630584 ...
##  $ X2009         : num  2498882682 12439087077 70307163678 12044223353 3674409558 ...
##  $ X2010         : num  2390502793 15856574731 83799496611 11926928506 3449966857 ...
##  $ X2011         : num  2549720670 17804292964 111789686464 12890765324 3629203786 ...
##  $ X2012         : num  2534636872 20001598506 128052853643 12319830252 3188808943 ...
##  $ X2013         : num  2701675978 20561069558 136709862831 12776217195 3193704343 ...
##  $ X2014         : num  2765363128 20484885120 145712200313 13228144008 3271808157 ...
##  $ X2015         : num  2919553073 19907111419 116193649124 11386846319 2789870188 ...
##  $ X2016         : num  2965921788 18017749074 101123851090 11861200797 2896679212 ...
##  $ X2017         : num  3056424581 18869945678 122123822334 13019693451 3000180750 ...
##  $ X2018         : num  NA 18353881130 101353230785 15147020535 3218316013 ...
##  $ X2019         : num  NA 19291104008 88815697793 15279183290 3154057987 ...
##  $ X2020         : logi  NA NA NA NA NA NA ...
##  $ X             : logi  NA NA NA NA NA NA ...
## 'data.frame':    235 obs. of  78 variables:
##  $ SDGRegion   : chr  "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" ...
##  $ SubRegion   : chr  "Eastern Africa" "Eastern Africa" "Eastern Africa" "Eastern Africa" ...
##  $ Country     : chr  "Burundi" "Comoros" "Djibouti" "Eritrea" ...
##  $ Notes       : int  NA NA NA NA NA NA NA NA 1 2 ...
##  $ Country.code: int  108 174 262 232 231 404 450 454 480 175 ...
##  $ Type        : chr  "Country/Area" "Country/Area" "Country/Area" "Country/Area" ...
##  $ Parent.code : int  910 910 910 910 910 910 910 910 910 910 ...
##  $ X1950       : chr  "  2 309" "   159" "   62" "   822" ...
##  $ X1951       : chr  "  2 360" "   163" "   63" "   835" ...
##  $ X1952       : chr  "  2 406" "   167" "   65" "   849" ...
##  $ X1953       : chr  "  2 449" "   170" "   66" "   865" ...
##  $ X1954       : chr  "  2 492" "   173" "   68" "   882" ...
##  $ X1955       : chr  "  2 537" "   176" "   70" "   900" ...
##  $ X1956       : chr  "  2 585" "   179" "   71" "   919" ...
##  $ X1957       : chr  "  2 636" "   182" "   74" "   939" ...
##  $ X1958       : chr  "  2 689" "   185" "   76" "   961" ...
##  $ X1959       : chr  "  2 743" "   188" "   80" "   983" ...
##  $ X1960       : chr  "  2 798" "   191" "   84" "  1 008" ...
##  $ X1961       : chr  "  2 852" "   194" "   89" "  1 033" ...
##  $ X1962       : chr  "  2 907" "   197" "   94" "  1 060" ...
##  $ X1963       : chr  "  2 964" "   200" "   101" "  1 089" ...
##  $ X1964       : chr  "  3 026" "   204" "   108" "  1 118" ...
##  $ X1965       : chr  "  3 094" "   207" "   115" "  1 148" ...
##  $ X1966       : chr  "  3 170" "   211" "   123" "  1 179" ...
##  $ X1967       : chr  "  3 253" "   216" "   131" "  1 210" ...
##  $ X1968       : chr  "  3 337" "   221" "   140" "  1 243" ...
##  $ X1969       : chr  "  3 414" "   225" "   150" "  1 276" ...
##  $ X1970       : chr  "  3 479" "   230" "   160" "  1 311" ...
##  $ X1971       : chr  "  3 530" "   235" "   169" "  1 347" ...
##  $ X1972       : chr  "  3 570" "   239" "   179" "  1 385" ...
##  $ X1973       : chr  "  3 605" "   244" "   191" "  1 424" ...
##  $ X1974       : chr  "  3 646" "   250" "   205" "  1 464" ...
##  $ X1975       : chr  "  3 701" "   257" "   224" "  1 505" ...
##  $ X1976       : chr  "  3 771" "   266" "   249" "  1 548" ...
##  $ X1977       : chr  "  3 854" "   276" "   277" "  1 592" ...
##  $ X1978       : chr  "  3 949" "   287" "   308" "  1 637" ...
##  $ X1979       : chr  "  4 051" "   297" "   336" "  1 684" ...
##  $ X1980       : chr  "  4 157" "   308" "   359" "  1 733" ...
##  $ X1981       : chr  "  4 267" "   318" "   375" "  1 785" ...
##  $ X1982       : chr  "  4 380" "   327" "   385" "  1 837" ...
##  $ X1983       : chr  "  4 498" "   336" "   394" "  1 891" ...
##  $ X1984       : chr  "  4 621" "   345" "   406" "  1 946" ...
##  $ X1985       : chr  "  4 751" "   355" "   426" "  2 004" ...
##  $ X1986       : chr  "  4 887" "   366" "   454" "  2 065" ...
##  $ X1987       : chr  "  5 027" "   377" "   490" "  2 127" ...
##  $ X1988       : chr  "  5 169" "   388" "   529" "  2 186" ...
##  $ X1989       : chr  "  5 307" "   400" "   564" "  2 231" ...
##  $ X1990       : chr  "  5 439" "   412" "   590" "  2 259" ...
##  $ X1991       : chr  "  5 565" "   424" "   607" "  2 266" ...
##  $ X1992       : chr  "  5 686" "   436" "   615" "  2 258" ...
##  $ X1993       : chr  "  5 798" "   449" "   619" "  2 239" ...
##  $ X1994       : chr  "  5 899" "   462" "   622" "  2 218" ...
##  $ X1995       : chr  "  5 987" "   475" "   630" "  2 204" ...
##  $ X1996       : chr  "  6 060" "   489" "   644" "  2 196" ...
##  $ X1997       : chr  "  6 122" "   502" "   661" "  2 195" ...
##  $ X1998       : chr  "  6 186" "   515" "   680" "  2 206" ...
##  $ X1999       : chr  "  6 267" "   529" "   700" "  2 237" ...
##  $ X2000       : chr  "  6 379" "   542" "   718" "  2 292" ...
##  $ X2001       : chr  "  6 526" "   556" "   733" "  2 375" ...
##  $ X2002       : chr  "  6 704" "   569" "   747" "  2 481" ...
##  $ X2003       : chr  "  6 909" "   583" "   760" "  2 601" ...
##  $ X2004       : chr  "  7 132" "   597" "   772" "  2 720" ...
##  $ X2005       : chr  "  7 365" "   612" "   783" "  2 827" ...
##  $ X2006       : chr  "  7 608" "   626" "   795" "  2 918" ...
##  $ X2007       : chr  "  7 862" "   642" "   805" "  2 997" ...
##  $ X2008       : chr  "  8 126" "   657" "   816" "  3 063" ...
##  $ X2009       : chr  "  8 398" "   673" "   828" "  3 120" ...
##  $ X2010       : chr  "  8 676" "   690" "   840" "  3 170" ...
##  $ X2011       : chr  "  8 958" "   707" "   854" "  3 214" ...
##  $ X2012       : chr  "  9 246" "   724" "   868" "  3 250" ...
##  $ X2013       : chr  "  9 540" "   742" "   883" "  3 281" ...
##  $ X2014       : chr  "  9 844" "   759" "   899" "  3 311" ...
##  $ X2015       : chr  "  10 160" "   777" "   914" "  3 343" ...
##  $ X2016       : chr  "  10 488" "   796" "   929" "  3 377" ...
##  $ X2017       : chr  "  10 827" "   814" "   944" "  3 413" ...
##  $ X2018       : chr  "  11 175" "   832" "   959" "  3 453" ...
##  $ X2019       : chr  "  11 531" "   851" "   974" "  3 497" ...
##  $ X2020       : chr  "  11 891" "   870" "   988" "  3 546" ...
## tibble [177 × 11] (S3: sf/tbl_df/tbl/data.frame)
##  $ iso_a2   : chr [1:177] "FJ" "TZ" "EH" "CA" ...
##  $ name_long: chr [1:177] "Fiji" "Tanzania" "Western Sahara" "Canada" ...
##  $ continent: chr [1:177] "Oceania" "Africa" "Africa" "North America" ...
##  $ region_un: chr [1:177] "Oceania" "Africa" "Africa" "Americas" ...
##  $ subregion: chr [1:177] "Melanesia" "Eastern Africa" "Northern Africa" "Northern America" ...
##  $ type     : chr [1:177] "Sovereign country" "Sovereign country" "Indeterminate" "Sovereign country" ...
##  $ area_km2 : num [1:177] 19290 932746 96271 10036043 9510744 ...
##  $ pop      : num [1:177] 885806 52234869 NA 35535348 318622525 ...
##  $ lifeExp  : num [1:177] 70 64.2 NA 82 78.8 ...
##  $ gdpPercap: num [1:177] 8222 2402 NA 43079 51922 ...
##  $ geom     :sfc_MULTIPOLYGON of length 177; first list element: List of 3
##   ..$ :List of 1
##   .. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
##   ..$ :List of 1
##   .. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geom"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:10] "iso_a2" "name_long" "continent" "region_un" ...

Clean Data

First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).

Clean air_pollution_df:

## 'data.frame':    6468 obs. of  4 variables:
##  $ Country                      : Factor w/ 231 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Deaths.Air.Pollution.per.100k: num  299 291 279 279 287 ...

Clean gdp_df:

## tibble [12,401 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Country        : Factor w/ 259 levels "Afghanistan",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ ISO.alpha3.code: Factor w/ 259 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year           : Factor w/ 60 levels "1960","1961",..: 27 28 29 30 31 32 33 34 35 36 ...
##  $ GDP.USD        : num [1:12401] 405463417 487602458 596423607 695304363 764887117 ...

Clean population_region_df:

## 'data.frame':    16685 obs. of  6 variables:
##  $ SDGRegion           : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SubRegion           : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Country             : Factor w/ 235 levels "Afghanistan",..: 34 34 34 34 34 34 34 34 34 34 ...
##  $ M49.code            : Factor w/ 235 levels "100","104","108",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Year                : Factor w/ 71 levels "1950","1951",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Population.thousands: num  2309 2360 2406 2449 2492 ...

Clean population_region_df:

## 'data.frame':    249 obs. of  4 variables:
##  $ Country.or.Area: Factor w/ 249 levels "Afghanistan",..: 6 234 1 10 8 3 12 7 9 11 ...
##  $ ISO.alpha2.code: Factor w/ 248 levels "AD","AE","AF",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ISO.alpha3.code: Factor w/ 249 levels "ABW","AFG","AGO",..: 7 8 2 14 4 6 10 3 12 9 ...
##  $ M49.code       : Factor w/ 249 levels "4","8","10","12",..: 6 225 1 8 190 2 16 7 3 10 ...

Clean world:

## tibble [177 × 2] (S3: sf/tbl_df/tbl/data.frame)
##  $ iso_a2: Factor w/ 175 levels "AE","AF","AL",..: 53 162 48 26 165 89 167 124 72 7 ...
##  $ geom  :sfc_MULTIPOLYGON of length 177; first list element: List of 3
##   ..$ :List of 1
##   .. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
##   ..$ :List of 1
##   .. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geom"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
##   ..- attr(*, "names")= chr "iso_a2"

Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.

Final DataFrame Construction

Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).

## 'data.frame':    5197 obs. of  12 variables:
##  $ ISO.alpha2.code              : Factor w/ 248 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ M49.code                     : Factor w/ 249 levels "4","8","10","12",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 23 24 1 2 3 4 5 6 7 8 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Country.x                    : Factor w/ 231 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Deaths.Air.Pollution.per.100k: num  17.7 17.2 29 28.7 28.5 ...
##  $ GDP.USD                      : num  3188808943 3193704343 1029048482 1106928583 1210013652 ...
##  $ SDGRegion                    : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SubRegion                    : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ Population.thousands         : num  82 81 55 57 59 61 63 64 64 64 ...
##  $ geom                         :sfc_MULTIPOLYGON of length 5197; first list element:  list()
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ gdp.per.capita               : num  38887914 39428449 18709972 19419800 20508706 ...

Our dataset is finally ready to be analyzed.

4. EDA - Exploratory Data Analysis

Quick Plots

Let’s start our EDA process by just looking at some quick plots to look at the distribution of data.

Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita

Figure 5,6,7: Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita.

Looks like deaths.air.pollution.per.100k, population, and gdp.per.capita are not normal and are all right skewed.

Boxplot of Air Pollution Induced Deaths, Population, and GDP per Capita

Let’s look at a boxplot for the outliers.

Figure 8: Boxplot of Deaths per 100,000 from Air Pollution vs SDG Region

Interesting to note that Australia/New Zealand, Europe, North America seem to have the lowest deaths per 100k from air pollution and are all fairly compactly packed together (low variance) relative to other regions around the world. Furthermore, these region contain the most advanced countries.

Let’s take another look but at SubRegions.

Figure 9: Boxplot of Deaths per 100k from Air Pollution vs Sub Region

Separating out into an even granular grouping of regions show some trends where Australia/New Zealand, North America, Northern Europe, and Western Europe all have low deaths per 100k and have low variance. Historically, these regions consist of countries that have been considered ‘First World’ before our first year of analysis of 1990. We will dig into this more later in our SMART questions.

What does the GDP per capita of these regions look like comparatively? Let’s take a look.

Figure 10: Boxplot of GDP per Capita vs Sub Region

Interesting to observe that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively. We will try to see if we can quantify this relationship later on in our main research analysis.

Map of Countries

Plotting maps and maps with intensities will be useful for us to visualize our data and the results of our analysis.

Figure 11: Global Map of SDGRegions and SubRegions
Figure 12: Global Intensity Map of Key Numerical Features, 1990 to 2017

Looks like some inverse correlation between gdp.per.capita and deaths.air.pollution.per.100k.

We can also use ggplot2 to have a bit more control over map plotting.

Figure 13: Global Intensity Map of Deaths due to Air Pollution per 100k People, 1990 to 2017
Figure 14: Intensity Map of Deaths due to Air Pollution per 100k People in East and Southeastern Asia, 2017

SMART Questions

1. Is there a relationship between population size and Deaths per 100,000 due to air pollution?

Below, we would like to measure the relationship between Population size (in thousands) and Deaths per 100,000 due to air pollution. Since these variables are numerical, we have to confirm the normal distribution of both variables, and from the results below, we see that there is no correlation between a country’s population size and their deaths due to air pollution. we do observe a negative correlation between Deaths due to air pollution and GDP per Capita.

str(final_df)
## 'data.frame':    5197 obs. of  12 variables:
##  $ ISO.alpha2.code              : Factor w/ 248 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ M49.code                     : Factor w/ 249 levels "4","8","10","12",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 23 24 1 2 3 4 5 6 7 8 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Country.x                    : Factor w/ 231 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Deaths.Air.Pollution.per.100k: num  17.7 17.2 29 28.7 28.5 ...
##  $ GDP.USD                      : num  3188808943 3193704343 1029048482 1106928583 1210013652 ...
##  $ SDGRegion                    : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SubRegion                    : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ Population.thousands         : num  82 81 55 57 59 61 63 64 64 64 ...
##  $ geom                         :sfc_MULTIPOLYGON of length 5197; first list element:  list()
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ gdp.per.capita               : num  38887914 39428449 18709972 19419800 20508706 ...
#check normality
qqnorm(final_df$Population.thousands)

qqnorm(final_df$Deaths.Air.Pollution.per.100k)

cor(final_df$Population.thousands, final_df$Deaths.Air.Pollution.per.100k, method = c("spearman"))
## [1] 0.037
#correlation matrix
pop_poll_cor<- cor(select(final_df, Deaths.Air.Pollution.per.100k, Population.thousands, gdp.per.capita,))
pop_poll_cor
##                               Deaths.Air.Pollution.per.100k
## Deaths.Air.Pollution.per.100k                         1.000
## Population.thousands                                  0.069
## gdp.per.capita                                       -0.543
##                               Population.thousands gdp.per.capita
## Deaths.Air.Pollution.per.100k                0.069         -0.543
## Population.thousands                         1.000         -0.040
## gdp.per.capita                              -0.040          1.000
xkabledply(pop_poll_cor)
Table
Deaths.Air.Pollution.per.100k Population.thousands gdp.per.capita
Deaths.Air.Pollution.per.100k 1.000 0.069 -0.543
Population.thousands 0.069 1.000 -0.040
gdp.per.capita -0.543 -0.040 1.000
#plot
loadPkg("corrplot")
corrplot(pop_poll_cor)

3. Which regions have the lowest and highest deaths due to air pollution?

library(scales)
#Aggregate data by total deaths by year and region 
deathsbyyear_reg <- group_by(.data = final_df,Year, SDGRegion)
totdeath_reg <- summarize(.data = deathsbyyear_reg, total = sum(Deaths.Air.Pollution.per.100k, na.rm = TRUE))
str(totdeath_reg)
## tibble [252 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ Year     : Factor w/ 28 levels "1990","1991",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ SDGRegion: Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 1 2 3 4 5 6 7 8 9 1 ...
##  $ total    : num [1:252] 50.5 1759.9 1290.9 1540.6 2272.3 ...
##  - attr(*, "groups")= tibble [28 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Year : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ .rows: list<int> [1:28] 
##   .. ..$ : int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..$ : int [1:9] 10 11 12 13 14 15 16 17 18
##   .. ..$ : int [1:9] 19 20 21 22 23 24 25 26 27
##   .. ..$ : int [1:9] 28 29 30 31 32 33 34 35 36
##   .. ..$ : int [1:9] 37 38 39 40 41 42 43 44 45
##   .. ..$ : int [1:9] 46 47 48 49 50 51 52 53 54
##   .. ..$ : int [1:9] 55 56 57 58 59 60 61 62 63
##   .. ..$ : int [1:9] 64 65 66 67 68 69 70 71 72
##   .. ..$ : int [1:9] 73 74 75 76 77 78 79 80 81
##   .. ..$ : int [1:9] 82 83 84 85 86 87 88 89 90
##   .. ..$ : int [1:9] 91 92 93 94 95 96 97 98 99
##   .. ..$ : int [1:9] 100 101 102 103 104 105 106 107 108
##   .. ..$ : int [1:9] 109 110 111 112 113 114 115 116 117
##   .. ..$ : int [1:9] 118 119 120 121 122 123 124 125 126
##   .. ..$ : int [1:9] 127 128 129 130 131 132 133 134 135
##   .. ..$ : int [1:9] 136 137 138 139 140 141 142 143 144
##   .. ..$ : int [1:9] 145 146 147 148 149 150 151 152 153
##   .. ..$ : int [1:9] 154 155 156 157 158 159 160 161 162
##   .. ..$ : int [1:9] 163 164 165 166 167 168 169 170 171
##   .. ..$ : int [1:9] 172 173 174 175 176 177 178 179 180
##   .. ..$ : int [1:9] 181 182 183 184 185 186 187 188 189
##   .. ..$ : int [1:9] 190 191 192 193 194 195 196 197 198
##   .. ..$ : int [1:9] 199 200 201 202 203 204 205 206 207
##   .. ..$ : int [1:9] 208 209 210 211 212 213 214 215 216
##   .. ..$ : int [1:9] 217 218 219 220 221 222 223 224 225
##   .. ..$ : int [1:9] 226 227 228 229 230 231 232 233 234
##   .. ..$ : int [1:9] 235 236 237 238 239 240 241 242 243
##   .. ..$ : int [1:9] 244 245 246 247 248 249 250 251 252
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
deaths_line <- ggplot() +
               geom_line(data = totdeath_reg, mapping = aes(x = Year, y = total, group = SDGRegion, color = SDGRegion), size = 1.2) +
              geom_point(data = totdeath_reg, mapping = aes(x = Year, y = total, group = SDGRegion, color = SDGRegion), size = 1.2) +
              scale_y_continuous(label = comma, limits = c(0, 15000), breaks = seq(0,15000,1500))
              # scale_x_discrete(limits = c(1990, 2017), breaks = seq(1990,2017,2)) 
deaths_line <- deaths_line + labs(title = "Deaths per 100,000 by Air Pollution, by Region",
                                            subtitle = "1990 - 2017", 
                                            caption = "Data Source: Kaggle",
                                            y = "Deaths due to Air Pollution",
                                            x = "") +
              theme_minimal() +
              theme(axis.title = element_text(size = 8, face = "bold"),
                    panel.grid.major.x = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank(),
                    panel.grid.major.y = element_blank(),
                    axis.line.x = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    axis.text = element_text(size = 10),
                    #legend.position = "none",
                    legend.text = element_text(size=5),
                    plot.subtitle = element_text(size = 8),
                    plot.title = element_text(size = 10, margin = margin(b = 10)))
            
            deaths_line <- deaths_line + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                                         plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                                         plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))
            deaths_line

4. How does deaths due to air pollution increase over time? More specifically, are death rates in recent X amount of years higher than death rates from groups of X years before?

library(scales)
#Aggregate data by total deaths by year
deathsbyyear <- group_by(.data = final_df,Year)
totdeath <- summarize(.data = deathsbyyear, tot_deaths = sum(Deaths.Air.Pollution.per.100k, na.rm = TRUE))
str(totdeath)
## tibble [28 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Year      : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ tot_deaths: num [1:28] 18161 17803 17927 18133 18134 ...
deaths_line <- ggplot() +
               geom_line(data = totdeath, mapping = aes(x = Year, y = tot_deaths, group = 1), size = 1.2) +
              geom_point(data = totdeath, mapping = aes(x = Year, y = tot_deaths, group = 1), size = 1.2) +
              scale_color_manual(values = c("darkmagenta")) +
              scale_y_continuous(label = comma, limits = c(0, 40000), breaks = seq(0,40000,10000)) 
             # scale_x_discrete(limits = c(1990, 2017), breaks = seq(1990,2017,2)) 

deaths_line <- deaths_line + labs(title = "Deaths per 1000,000 by Air Pollution",
                                            subtitle = "1990 - 2017", 
                                            caption = "Data Source: Kaggle",
                                            y = "Deaths due to Air Pollution",
                                            x = "") +
              theme_minimal() +
              theme(axis.title = element_text(size = 8, face = "bold"),
                    panel.grid.major.x = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank(),
                    panel.grid.major.y = element_blank(),
                    axis.line.x = element_line(color = "black"),
                    axis.ticks = element_line(color = "black"),
                    axis.text = element_text(size = 10),
                    legend.position = "none",
                    plot.subtitle = element_text(size = 8),
                    plot.title = element_text(size = 10, margin = margin(b = 10)))
            
            deaths_line <- deaths_line + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
                                         plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
                                         plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))
            deaths_line

## 'data.frame':    6468 obs. of  7 variables:
##  $ Entity                                         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Code                                           : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ Year                                           : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Air.pollution..total...deaths.per.100.000.     : num  299 291 279 279 287 ...
##  $ Indoor.air.pollution..deaths.per.100.000.      : num  250 243 232 232 239 ...
##  $ Outdoor.particulate.matter..deaths.per.100.000.: num  46.4 46 44.2 44.4 45.6 ...
##  $ Outdoor.ozone.pollution..deaths.per.100.000.   : num  5.62 5.6 5.61 5.66 5.72 ...

5. Main Research Question

Do lower GDP countries have more deaths per 100k due to air pollution?

Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?

Linear Fit

Let’s first look at the general fit on the overall data.

Fig XX: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.

From the plot, we observe that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.

Even looking at a each individual SDGRegion, their linear fits get better overall but are still not particularly strong with the highest being Australia/New Zealand and Europe at R2 of 0.56 and 0.55 respectively.

Fig XX: Linear models for each SDGRegion, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.

Let’s now look at how time plays a part. : Fig XX: Linear models for each Year, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.

As observed, time does not seem to play a significant part in describing the relationship between deaths due to air pollution per 100k vs GDP per capita as the R2 stays roughly constant around 0.3 across all the years.

Transformed Log Scale - Linear Fit

Perhaps we should look at a non-linear fit. From our visuals, we see that every plot starts off at really high deaths due to air pollution per 100k then drops off dramatically as GDP per capita increases. However, the drop off begins to tamper off and asymptotically approaches some value. (It will be interesting to see if we can generalize what that GDP per capita value is. Let’s table that for later.) We have seen this type of behavior before in log graphs like shown below.

Sample Log Graph
Fig XX: Sample log graph.

Our data seems to be a -log(x) instead of log(x). Let’s transform our linear fit to a log fit by wrapping our features into a log() function and fitting back to a linear fit and see what the relationship is.

## 
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita), 
##     data = final_df_sf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.099 -0.235  0.000  0.206  1.431 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)         10.07849    0.04871     207 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952    0.00323    -121 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.737 
## F-statistic: 1.45e+04 on 1 and 5195 DF,  p-value: <0.0000000000000002
Fig XX, XX, XX: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship across the board.

Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.

Let’s test a few more regression models adding more features.

## [1] 0.883
## [1] 0.887
## [1] 0.747

R2 values for adding more features are 0.883, 0.887, and 0.747.

Fig XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year
log(gdp.per.capita) log(gdp.per.capita):SubRegionCaribbean log(gdp.per.capita):SubRegionCentralAmerica log(gdp.per.capita):SubRegionCentralAsia log(gdp.per.capita):SubRegionEasternAfrica log(gdp.per.capita):SubRegionEasternAsia log(gdp.per.capita):SubRegionEasternEurope log(gdp.per.capita):SubRegionMelanesia log(gdp.per.capita):SubRegionMicronesia log(gdp.per.capita):SubRegionMiddleAfrica log(gdp.per.capita):SubRegionNorthernAfrica log(gdp.per.capita):SubRegionNORTHERNAMERICA log(gdp.per.capita):SubRegionNorthernEurope log(gdp.per.capita):SubRegionPolynesia log(gdp.per.capita):SubRegionSouth-EasternAsia log(gdp.per.capita):SubRegionSouthAmerica log(gdp.per.capita):SubRegionSouthernAfrica log(gdp.per.capita):SubRegionSouthernAsia log(gdp.per.capita):SubRegionSouthernEurope log(gdp.per.capita):SubRegionWesternAfrica log(gdp.per.capita):SubRegionWesternAsia log(gdp.per.capita):SubRegionWesternEurope SubRegionCaribbean SubRegionCentralAmerica SubRegionCentralAsia SubRegionEasternAfrica SubRegionEasternAsia SubRegionEasternEurope SubRegionMelanesia SubRegionMicronesia SubRegionMiddleAfrica SubRegionNorthernAfrica SubRegionNORTHERNAMERICA SubRegionNorthernEurope SubRegionPolynesia SubRegionSouth-EasternAsia SubRegionSouthAmerica SubRegionSouthernAfrica SubRegionSouthernAsia SubRegionSouthernEurope SubRegionWesternAfrica SubRegionWesternAsia SubRegionWesternEurope
975 7945 5044 3147 9101 2492 5911 2988 2640 5154 3892 3611 5909 1944 5987 7157 3933 5166 7168 9144 9611 5771 6747 3902 2131 5646 2101 4744 2281 2106 3457 2936 3714 5880 1584 4427 5680 3037 3437 6334 5707 8049 5965
Fig XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year.
log(gdp.per.capita) log(gdp.per.capita):SubRegionCaribbean log(gdp.per.capita):SubRegionCentralAmerica log(gdp.per.capita):SubRegionCentralAsia log(gdp.per.capita):SubRegionEasternAfrica log(gdp.per.capita):SubRegionEasternAsia log(gdp.per.capita):SubRegionEasternEurope log(gdp.per.capita):SubRegionMelanesia log(gdp.per.capita):SubRegionMicronesia log(gdp.per.capita):SubRegionMiddleAfrica log(gdp.per.capita):SubRegionNorthernAfrica log(gdp.per.capita):SubRegionNORTHERNAMERICA log(gdp.per.capita):SubRegionNorthernEurope log(gdp.per.capita):SubRegionPolynesia log(gdp.per.capita):SubRegionSouth-EasternAsia log(gdp.per.capita):SubRegionSouthAmerica log(gdp.per.capita):SubRegionSouthernAfrica log(gdp.per.capita):SubRegionSouthernAsia log(gdp.per.capita):SubRegionSouthernEurope log(gdp.per.capita):SubRegionWesternAfrica log(gdp.per.capita):SubRegionWesternAsia log(gdp.per.capita):SubRegionWesternEurope SubRegionCaribbean SubRegionCentralAmerica SubRegionCentralAsia SubRegionEasternAfrica SubRegionEasternAsia SubRegionEasternEurope SubRegionMelanesia SubRegionMicronesia SubRegionMiddleAfrica SubRegionNorthernAfrica SubRegionNORTHERNAMERICA SubRegionNorthernEurope SubRegionPolynesia SubRegionSouth-EasternAsia SubRegionSouthAmerica SubRegionSouthernAfrica SubRegionSouthernAsia SubRegionSouthernEurope SubRegionWesternAfrica SubRegionWesternAsia SubRegionWesternEurope Year1991 Year1992 Year1993 Year1994 Year1995 Year1996 Year1997 Year1998 Year1999 Year2000 Year2001 Year2002 Year2003 Year2004 Year2005 Year2006 Year2007 Year2008 Year2009 Year2010 Year2011 Year2012 Year2013 Year2014 Year2015 Year2016 Year2017
987 8010 5074 3170 9188 2516 5949 2997 2662 5201 3917 3613 5952 1951 6051 7192 3955 5204 7232 9195 9704 5774 1.93 1.94 1.95 1.96 1.99 1.99 1.99 1.99 1.99 2.02 2.02 2.05 2.05 2.06 2.07 2.08 2.09 2.11 2.1 2.11 2.12 2.12 2.13 2.13 2.11 2.1 2.11 6800 3922 2144 5695 2121 4771 2285 2122 3486 2952 3717 5922 1587 4473 5704 3051 3458 6390 5730 8125 5967
Fig XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*Year.
log(gdp.per.capita) log(gdp.per.capita):Year1991 log(gdp.per.capita):Year1992 log(gdp.per.capita):Year1993 log(gdp.per.capita):Year1994 log(gdp.per.capita):Year1995 log(gdp.per.capita):Year1996 log(gdp.per.capita):Year1997 log(gdp.per.capita):Year1998 log(gdp.per.capita):Year1999 log(gdp.per.capita):Year2000 log(gdp.per.capita):Year2001 log(gdp.per.capita):Year2002 log(gdp.per.capita):Year2003 log(gdp.per.capita):Year2004 log(gdp.per.capita):Year2005 log(gdp.per.capita):Year2006 log(gdp.per.capita):Year2007 log(gdp.per.capita):Year2008 log(gdp.per.capita):Year2009 log(gdp.per.capita):Year2010 log(gdp.per.capita):Year2011 log(gdp.per.capita):Year2012 log(gdp.per.capita):Year2013 log(gdp.per.capita):Year2014 log(gdp.per.capita):Year2015 log(gdp.per.capita):Year2016 log(gdp.per.capita):Year2017 Year1991 Year1992 Year1993 Year1994 Year1995 Year1996 Year1997 Year1998 Year1999 Year2000 Year2001 Year2002 Year2003 Year2004 Year2005 Year2006 Year2007 Year2008 Year2009 Year2010 Year2011 Year2012 Year2013 Year2014 Year2015 Year2016 Year2017
34.7 187 179 181 177 183 185 186 185 183 185 186 187 187 189 191 194 197 202 209 213 216 219 221 223 224 222 222 187 180 181 177 184 187 189 187 185 187 188 190 192 196 200 205 210 217 223 228 232 236 239 240 240 238 238

Although adding more features into our regression model results in higher R2 values, the Variance Inflation Factor (VIF) for each are extremely high so we will reject those models as those added features are highly correlated with each other. Therefore, we will stick with our second model fit2.

We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:

\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 10.07849 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]

or

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?

Let’s test if means of deaths caused by air pollution per 100k across different GDP per capita levels are not equal.

H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp

H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal

Use \(\alpha\) of 0.05.

The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.

I will conduct 4 2-sample t-tests:

  • Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
  • Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
  • Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
  • Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp

I will use a two sample t-test for each and use \(\alpha\) of 0.05.

Test 1:

p-valuetest1: 2.99e-203

p-valuetest1 < \(\alpha\)0.05 = TRUE

Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.

Test 2:

p-valuetest2: 1.47e-13

p-valuetest2 < \(\alpha\)0.05 = TRUE

Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 3:

p-valuetest3: 0e+00

p-valuetest3 < \(\alpha\)0.05 = TRUE

Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 4:

p-valuetest4: 2.91e-06

p-valuetest4 < \(\alpha\)0.05 = TRUE

Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

6. Conclusion

From all of our tests, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the strength and model can be quantified by Equation 2:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

7. Bibliography

Fig X: References
Name Link
Making maps with R Link
Geographic data in R Link
Maps in ggplot2 Link
ggplot2 color scales Link
Sample log graph Link